library(tidyverse)
if (file.exists("../startup.R")) source("../startup.R") # for lecture slides only
ggplot¶In this notebook, we will cover:
Recall the diamonds data set.
print(diamonds)
Let us create a bar chart using the cut variable. Recall that this is done by the geom_bar geometry.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
This aesthetic looks different than what we have seen before. We didn't supply a y variable, and the y variable in the plot, count, is not even a variable in our data set.
So we see that a bar chart differs quite fundamentally from a scatter plot. A scatter plot uses the raw data variables directly. A bar chart applies a statistical transformation (stat_count in our case) to create the counts and then plots the counts vs the raw variable (cut in our case).
How do we know that geom_bar uses stat_count as the default transformation? You can type ?geom_bar in RStudio or consult the online documentation.
Geoemetries and Statistical transformations comes in pairs. E.g.,
geom_point and "identity"geom_smooth and stat_smoothgeom_bar and stat_countThese are the defaults. Although it is usually unnecessary, they can be overridden.
Let's create create a simple synthetic data set. To do this we use the tribble command, which is a streamlined way to hand-enter data in R. We'll cover it later on.
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
print(demo)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
geom_col is used when you wants the bar heights to represent values in the data. E.g., the plot below shows the total price (in millions of dollars) of the diamonds in each cut category.
ggplot(data = diamonds) +
geom_col(mapping = aes(x = cut, y = price / 1e6))
There are a lot of transformations that geom_bar is doing internally to create the plot above. E.g., one way to generate the plot by doing explicit data transformations is as follows. Don't worry about the details; we will cover data transformations (like select, group_by, summarise) later.
my_tibble <- select(diamonds, cut, price) %>%
group_by(cut) %>%
summarise(total_price_millions = sum(price)/1e6)
print(my_tibble)
ggplot(data = my_tibble) +
geom_bar(mapping = aes(x = cut, y = total_price_millions), stat = "identity")
Let us see what happens when we map the color aesthetic to the cut variable.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))
It just changed the boundary color. Using the fill aesthetic might be better.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
What happens when we map the fill aesthetic to some variable other than cut, say clarity?
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
If stacking is not the behavior you want, you can set the position argument to something other than "stack".
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") # stacks but shows proportions
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") # put the bars side by side
There is a position adjustment that is less useful in bar plots but can be useful in scatter plots.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
It's difficult to see but there are overlapping points in the plot above. For example, the lowest point (point with the lowest hwy value) actually consists of 5 overlapping points: 2 SUVs and 3 pickup trucks.
filter(mpg, hwy == 12)
Let's see if the overlapping points show up when we use the jitter position adjustment in a normal scatter plot.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
geom_jitter is just a shorthand for geom_point(position = "jitter")
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))
Returning to the bar graph from above:
(bar <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1)) # set width so that the bars touch
In some cases a horizontal bar chart might be desirable. This amounts to simply flipping the $x$ and $y$ axes. Using ggplot(), this becomes a one-liner:
bar + coord_flip()
A pie chart is a circular chart where the angle of each wedge is proportional to the frequency of each category. To get a pie chart, we first create a stacked bar chart:
(bar_stacked <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = factor(1), fill = cut), width = 1)) # set width so that the bars touch
Then, we plot in it polar coordinates, assigning the angle value $\theta$ to $y$:
bar_stacked +
labs(x = NULL) + # remove the x axis label "cut"
coord_polar(theta = "y") # change to polar coordinates to get a pie chart
# "y" (in quotes) is required here
Notice that we mapped the Y axis of the bar chart to the angle theta. By default, it will map to the radial coordinate giving us a bullseye chart.
bar_stacked +
labs(x = NULL) + # remove the x axis label "cut"
coord_polar() # change to polar coordinates to get a pie chart
A Coxcomb chart is another way to represent data in polar coordinates. Instead of setting the radius to be proportional to frequency, we will keep the radius constant and make area proportional to frequency.
bar +
labs(x = NULL) + # remove the x axis label "cut"
coord_polar() # change to polar coordinates to get a Coxcomb chart
ggplot syntax¶Now you have learned about all elements of the following ggplot template:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
ggplot topics¶Here are a few additional topics that you might find useful as you delve deeper into ggplot.
You can easily produce histograms and kernel density estimates using the same approach.
ggplot(diamonds, aes(x=carat, y=..density..)) + geom_histogram()
Choice of binwidth is very important when plotting histograms. ggplot intentionally does something basic to force you to think about this choice:
p = ggplot(diamonds, aes(x=carat, y=..density..))
p1 = p + geom_histogram(binwidth=.01)
p2 = p + geom_histogram(binwidth=.1)
p3 = p + geom_histogram(binwidth=.5)
gridExtra::grid.arrange(p1, p2, p3, ncol=3)
Here I also used the grid.arrange command in the gridExtra package to combine multiple plots.
p = (p + geom_histogram(color = "grey30", fill = "white") + geom_density() )
Adding labels and titles is extremely important when publishing plots. An uninterpretable plot is worse than no plot at all!
(p = p + ggtitle("Carat Distribution") + xlab("Carat") + ylab("Density") + xlim(c(0,2.5)))
The default look of ggplot is an acquired taste (to some). Fortunately, almost every aspect of the appearance can be configured. Even more fortunately, people have done this in a variety of styles. We'll use the ggthemr package to quickly switched between different themes.
# install.packages("ggthemes")
library(ggthemes)
p + theme_tufte() + annotate("text", x=1.5, y=1.5, label="Classy!")
Sometimes histograms can be complemented with markers for mean, etc.:
p + geom_vline(xintercept = mean(diamonds$carat), linetype="dashed", color="red")
Some data are more naturally plotted on a different scale. To accomplish this in ggplot, we use the scale_* functions.
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() +
scale_x_log10("Engine Displacement") +
scale_y_continuous("Highway Mileage")
In practice you will find yourself combining many of these techniques. The process of creating high-quality visuals is iterative and somewhat artistic -- there is no "one right answer". In the last few slides, I want to give you some intuition for how such a process might unfold.
data(midwest)
print(midwest)
midwest_top25 = midwest %>% select(county, state, percollege) %>%
arrange(desc(percollege)) %>% top_n(25)
print(midwest_top25)
(p = ggplot(midwest_top25, aes(x=county, y=percollege)) + geom_bar(stat="identity"))
(p = p + coord_flip())
midwest_top25 = midwest_top25 %>% mutate(county = factor(county, levels = .$county))
print(midwest_top25)
(p = ggplot(midwest_top25, aes(x=county, y=percollege)) + geom_bar(stat="identity") + coord_flip())
(p = ggplot(midwest_top25, aes(x=county, y=percollege)) + geom_point() + coord_flip())
p + geom_segment(aes(y=0, yend=percollege, x=county, xend=county))
mu = mean(midwest_top25$percollege)
midwest_top25 = midwest_top25 %>% mutate(above=percollege > mu)
p = ggplot(midwest_top25, aes(x=county, y=percollege)) + geom_point(aes(color=above)) +
coord_flip() + geom_segment(aes(y=mu, yend=percollege, x=county, xend=county))
p
p + theme_tufte() + ggtitle("College education by county") +
ylab("% College Educated") + xlab("County")